Spontaneous Speech Recognition Using Visual Context-Aware Language Models
نویسندگان
چکیده
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection. An experimental task was created in which people were asked to refer, using speech alone, to objects arranged on a table top. During training, Fuse acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired with verbal descriptions of individual objects. Fuse determines a set of visually salient words and phrases and associates them to a set of visual features. Given a new scene, Fuse uses the acquired knowledge to generate class-based language models conditioned on the objects present in the scene as well as a spatial language model that predicts the occurences of spatial terms conditioned on target and landmark objects. The speech recognizer in Fuse uses a weighted mixture of these language models to search for more likely interpretations of user speech in context of the current scene. During decoding, the weights are updated using a visual attention model which redistributes attention over objects based on partially decoded utterances. The dynamic situationally-aware language models enable Fuse to jointly infer spoken language utterances underlying speech signals as well as the identities of target objects they refer to. In an evaluation of the system, visual situationally-aware language modeling shows significant , more than 30 %, decrease in speech recognition and understanding error rates. The underlying ideas of situation-aware speech understanding that have been developed in Fuse may may be applied in numerous areas including assistive and mobile human-machine interfaces. Thesis Supervisor: Deb K. Roy Title: AT&T Career Development Professor Media Arts and Sciences
منابع مشابه
Allophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملModeling Speech Repairs and Intonational Phrasing to Improve Speech Recognition
The spontaneous speech events of speech repairs and intonational phrasing cause disruptions in the local context, and this disruption prevents traditional language models from being able to properly predict the words in the vicinity of these events. The solution is to use a language model that can account for these spontaneous speech events. In this paper, we use such a model to rescore word gr...
متن کاملRapid Language Model Development for New Task Domains
Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free grammar to generate a corpus of word collocations...
متن کاملAcoustic modeling for spontaneous speech recognition using syllable dependent models
This paper proposes a syllable context dependent model for spontaneous speech recognition. It is generally assumed that, since spontaneous speech is greatly affected by coarticulation, an acoustic model featuring a longer range phonemic context is required to achieve a high degree of recognition accuracy. This motivated the authors to investigate a tri-syllable model that takes differences in t...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کامل